Chapter 4 — How does one OO model a text?

Quite frankly only now comes the hard part. We have a base text, but now what. What does it mean to slow program it, to close read it through code, or to read it as code? I am not sure at all. I think I'll just take a deep breath and dive in…

The first word of our text is "Willem". What would close reading that word mean? I guess at least establishing all that is known about in a scholarly context. The word is a name. That is a linguistic observation we might want to note. It is a Dutch name, the equivalent of William. Because it is a name the Nederlandse Voornamenbank of the Meertens Institute has some general onomstic knowledge on it. For instance that it is derived from Wilhelm and thus shares it roots. Should this knowledge be modeled in? Strictly speaking the onomastics of "Willem" are from a scholarly point of view less interesting than the fact that the person of that very name was the author of the text and what we might know of that author. But what is known about Willem the author? We find ample information (and at the same time very little, because very little is factually known about Willem) in the edition of the Reynaert made by Frank Lulofs (1983).

Lulofs contends that Willem has been a professional writer (1983:44), because the following words are "die Madoc maecte". Madoc is probably another work (that now is lost and unknown to us) and we may read this as Willem advertising his professional capabilities. We do, however, have to factor in the possibility that Willem is being ironic (1983:49). A way to read the Reynaert is as a play of words and rhetoric in the most ironic possible sense: turning tropes, conventions, and the trust put in words into weapons to criticize human behavior, including their abuse of words to their own advantage. Willem might have meant his prologue as an ironic play on rhetorical conventions of the prologue (1983:197). Given his command of the French Renard tradition Willem must have been "well-educated and widely read", he "was familiar with Old French beast narratives, which provided material and inspiration, and was well-informed about legal procedures. He may have been a monk with considerable experience in worldly affairs", and the poem’s language shows that Willem came from East Flanders" (Bouwman & Besamusca 2009:15). Willem is clearly familiar with Ghent and the country to the North East stretching towards Antwerp, the Waesland where the story is located; yet other geographical locations in the now Netherlands are also mentioned (Lulofs 1983:45).

So that body of knowledge at least (but potentially a lot more, this notebook is not an exercise in exhaustiveness) must be somehow modeled into that first word. And let's not forget that indeed we should model that it is indeed the first word of the text too. Oh, and while we are at it, we should probably somehow represent that the first letter of the word is a large red capital in the Comburg manuscript.

So, how do we do this?

At this point my plan looks like this:

  1. Read in the text.
  2. Create a super class 'Word' that instantiates and links the individual words.
  3. Make it possible for any word to have a meaning and a denotation (reference).
  4. For the latter we need classes and objects for the denotation, e.g. to be a person with the name 'Willem'.
  5. Then we can adorn this person with all kinds of annotations taken from the above.

There are number of things nagging me. One is: do I keep structure information? At least I hold on to line endings, although I am not even sure why. The only reason I can come up with now is for sheer representational purposes, that is: visualizing the text in its bookish form. But of course this is an experiment to try to get away from the book metaphor. I can forget about this for now however, I can always take in more structural information later on if I need it.

The other thing bothering me most is the reproducibility of the act of getting the annotative information. Ideally the application takes that information right from the source. So in this case probably from the commentary in the editions of Lulofs (1983) and Bouwman & Besamuca (2009). The latter at least is available and downloaded, the former has been scanned by Google, but seems to be not openly available. I guess I'll scan the pages I need as a form of citation. Then I will have to extract the parts pertaining to Willem. That in itself has its conundrums on how one would do that in any sustainable fashion.

Cherry picking as a natural language processing tool

I am about to go off modeling word 1 in the text of the Reynaert. That word, 'Willem' has an awful lot of contextual knowledge connected to it. I will 'mine' that knowledge from two sources to express it within the object model of the text. I cannot just copy/paste or paraphrase the text, as that would not result in program code reproducing any of the scholarly action associated with finding and compiling the knowledge. In an ideal case I would be able to send out an agent to find all the relevant passages pertaining to Willem in the two sources I will be using. Unfortunately search engines, machine learning algorithms, and natural language processing tools are not sophisticated enough yet to mine a single book length source for such an entity and returning the related and connected phrases that are of interest. So again I will have to mimic this, similar to the case of transcribing the manuscript. I will probably just cherry pick certain fragments from the digitized texts based on character offsets. That would yield satisfying precision, but it would also be just hard coding a lot of scholarly knowledge rather than reproducing the performance. So maybe I will use some crude searching and context selecting. A computational quick and dirty approximation of looking in the literature for knowledge on Willem.

Text, Verse, Word, Person

For no particular reasons the objects that I could object model strike me as being the Text itself, each Verse of it, that contains Words, and the first word of the first verse refers to a Person. I will dive in and will create a class Text. At some I will want to visualize this stuff, thus I will add as_text() and as_json() methods, that are to reproduce the text in those respective formats.1


In [ ]:
require "json"

class Text

  attr_accessor :first_verse

  def initialize( str )
    @first_verse = Verse.new( str )
  end

  def as_text
    @first_verse.as_text
  end

  def as_json
    json = "{ \"nodes\": [ "
    json << @first_verse.as_json
    json << "] }"
    json
  end

end

The idea here is that text 'internally' at least has one graph: the linked chain that is made up by the words of the story. Therefor I want to try here to let the text model be constructed by reading in the source string by the Text object. Then that should initialize the first verse, which models the first verse. The first verse instantiates the second verse, the second the first, etc. Of course each verse instantiates also the modeling of the individual words: the first word is instantiated by the first verse, then the first words instantiates the second word, the second word the third, etc.

This is why the Text class is relatively small: it only has the task of deferring all the actual work to other objects. Class Verse by comparison, has a lot of logic. The initialize method delegates the instantiation of its words to the Word class. Each verse checks if there is still a new line (and thus a new verse) to be found. If so it delegates the instantiation of the next verse to a new verse. The as_text() and as_json() methods work in pretty much the same way: if there are more verses, they let these verses produce themselves as text or json and then add their result to their own.

This way of instantiating is based on recursion. If you do not know what this is, you may want to have a quick look at the entry on recursion in Wikipedia.


In [ ]:
class Verse

  attr_accessor :first_word
  attr_accessor :next_verse

  def initialize( str )
    m = str.match( /\n/ )
    if m != nil
      index = str.match( /\n/ ).end(0)
      @first_word = Word.new( str[ 0..index-2 ] )
      if str[ index..-1 ].size > 0
        @next_verse = Verse.new( str[ index..-1 ] )
      end
    else
      @first_word = Word.new( str )
    end
  end

  def as_text
    @first_word.as_text
    if @next_verse != nil
      print "\n"
      @next_verse.as_text
    end
  end

  def as_json( pos=0 )
    sub_json = @first_word.as_json( pos )
    json = sub_json[0]
    if @next_verse != nil
      json << ", "
      json << @next_verse.as_json(sub_json[1])
    end
    json
  end

end

Finally we arrive on the Word level. In accordance with my intention I want a word to be able to have a denotation, so that a word like "Willem" eventually may have the denotation of a Person that once existed and bore the name "Willem". Weirdly enough the scholarly most significant element of the Word class is just the place holder for the denotation (attr_accessor :denotation). The other methods maybe look arcane again, but this has mostly to do again with their recursive nature to model the text.2


In [ ]:
class Word

  attr_accessor :surface
  attr_accessor :next_word
  attr_accessor :denotation

  def initialize( str )
    m = str.match( / / )
    if m != nil
      index = str.match( / / ).end(0)
      @surface = str[ 0..index-2 ]
      @next_word = Word.new( str[ index..-1 ] )
    else
      @surface = str
    end
  end

  def as_text
    print @surface
    if @next_word != nil
      print " "
      @next_word.as_text
    end
  end

  def as_json( pos=0 )
    json = "{ \"wrd\": \"#{@surface}\", \"x\": 0, \"y\": #{pos}"
    if @denotation != nil
      json << ", \"denotation\": " << @denotation.as_json
    end
    json << " }"
    pos += 1
    if @next_word != nil
      json << ", "
      sub_json = @next_word.as_json( pos )
      json << sub_json[0]
      pos = sub_json[1]
    end
    [json,pos]
  end

end

Words have denotations, they refer to entities in (a once) reality. These denotations have import for the meaning of a text (cf. e.g. Compagnon 2004:60. I want to be able to signify these denotations. Certainly in the case of the first word of our text, which is a name "Willem". Thus we need to be able to express in the object model that this name denoted a person. Hence we create a Person class. This person class is able to 'know' (that is: to store) the name and possible annotations (knowledge or information) on a person, e.g. Willem.


In [ ]:
class Person

  attr_accessor :name
  attr_accessor :annotations

  def initialize( name )
    @name = name
  end

  def as_json
    annotations_json = "{ \"class\": \"#{Person}\", \"name\": \"#{@name}\", \"annotations\": [ "
    annotations.each do |annotation|
      annotations_json << "{ \"annotation\": #{annotation.text.to_json}, \"source\": #{annotation.source.to_json} }, "
    end
    annotations_json = annotations_json[0..-3] << " ] }"
  end

end

We now have all the object models in place that will allow us to model the first word of the text. However: how do we get our annotations on Willem? This is the topic of the next chapter.

Notes

1) Full disclosure: the as_text() method came first as a method to test whether the object model actually represented the text adequately. The as_json() method was added later when visualization required such.

2) This might very well a reason for changing the structure of the code at some point, especially the type of relation between the object and how these relations are constructed at run time. Often the model that is conceptually the closest to the domain is not the model that is from the viewpoint of clarity of coding and of feasible computationality not the best possible model. In this notebook however I am specifically interested in the closest fit to the domain model, at any cost.

</small>

References

Bouwman, A. & Besamusca, B., 2009. Of Reynaert the Fox: Text and Facing Translation of the Middle Dutch Beast Epic Van den vos Reynaerde, Amsterdam: Amsterdam University Press. Available at: http://www.oapen.org/search?identifier=340003 [Accessed November 20, 2015].

Compagnon, A., 2004. Literature, Theory, and Common Sense, Princeton and Oxford: Princeton University Press.

Lulofs, F (ed.), 1983. Willem: Van den vos Reynaerde. Groningen: Wolters–Noordhoff.

</small>


In [ ]: